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Abstract: This paper discusses the potential of graphics processing units 
(GPUs) in high-dimcnsional optimization problems. A single GPU card 
with hundreds of arithmetic cores can be inserted in a personal computer 
and dramatically accelerates many statistical algorithms. To exploit these 
devices fully, optimization algorithms should reduce to multiple parallel 
tasks, each accessing a limited amount of data. These criteria favor EM 
and MM algorithms that separate parameters and data. To a lesser extent 
block relaxation and coordinate descent and ascent also qualify. We demon- 
strate the utility of GPUs in nonnegativc matrix factorization, PET image 
reconstruction, and multidimensional scaling. Spcedups of 100 fold can eas- 
ily be attained. Over the next decade, GPUs will fundamentally alter the 
landscape of computational statistics. It is time for more statisticians to 
got on-board. 
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1. Introduction 

Statisticians, like all scientists, are acutely aware that the clock speeds on their 
desktops and laptops have stalled. Does this mean that statistical computing 
has hit a wall? The answer fortunately is no, but the hardware advances that 
we routinely expect have taken an interesting detour. Most computers now sold 
have two to eight processing cores. Think of these as separate CPUs on the 
same chip. Naive programmers rely on sequential algorithms and often fail to 
take advantage of more than a single core. Sophisticated programmers, the kind 
who work for commercial firms such as Matlab, eagerly exploit parallel program- 
ming. However, multicore CPUs do not represent the only road to the success 
of statistical computing. 

Graphics processing units (GPUs) have caught the scientific community by 
surprise. These devices are designed for graphics rendering in computer anima- 
tion and games. Propelled by these nonscientific markets, the old technology of 
numerical (array) coprocessors has advanced rapidly. Highly parallel GPUs are 
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now making computational inroads against traditional CPUs in image process- 
ing, protein folding, stock options pricing, robotics, oil exploration, data mining, 
and many other areas [27]. We are starting to see orders of magnitude improve- 
ment on some hard computational problems. Three companies, Intel, NVIDIA, 
and AMD/ATI, dominate the market. Intel is struggling to keep up with its 
more nimble competitors. 

Modern CPUs support more vector and matrix operations, stream data 
faster, and possess more local memory per core than their predecessors. They 
are also readily available as commodity items that can be inserted as video 
cards on modern PCs. CPUs have been criticized for their hostile program- 
ming environment and lack of double precision arithmetic and error correction, 
but these faults are being rectified. The CUDA programming environment [26] 
for NVIDIA chips is now easing some of the programming chores. We could 
say more about near-term improvements, but most pronouncements would be 
obsolete within months. 

Oddly, statisticians have been slow to embrace the new technology. Silberstein 
et al [30] first demonstrated the potential for CPUs in fitting simple Bayesian 
networks. Recently Suchard and Rambaut [32] have seen greater than 100-fold 
speed-ups in MCMC simulations in molecular phylogeny. Lee et al [17] and Tib- 
bits et al [33] are following suit with Bayesian model fitting via particle filtering 
and slice sampling. Finally, work is under-way to port common data mining 
techniques such as hierarchical clustering and multi-factor dimensionality re- 
duction onto CPUs [31]. These efforts constitute the first wave of an eventual 
flood of statistical and data mining applications. The porting of GPU tools into 
the R environment will undoubtedly accelerate the trend [3]. 

Not all problems in computational statistics can benefit from CPUs. Sequen- 
tial algorithms are resistant unless they can be broken into parallel pieces. Even 
parallel algorithms can be problematic if the entire range of data must be ac- 
cessed by each GPU. Because they have limited memory, CPUs are designed to 
operate on short streams of data. The greatest speedups occur when all of the 
CPUs on a card perform the same arithmetic operation simultaneously. Effec- 
tive applications of CPUs in optimization involves both separation of data and 
separation of parameters. 

In the current paper, we illustrate how CPUs can work hand in glove with 
the MM algorithm, a generalization of the EM algorithm. In many optimization 
problems, the MM algorithm explicitly separates parameters by replacing the 
objective function by a sum of surrogate functions, each of which involves a single 
parameter. Optimization of the one-dimensional surrogates can be accomplished 
by assigning each subproblem to a different core. Provided the different cores 
each access just a slice of the data, the parallel subproblems execute quickly. 
By construction the new point in parameter space improves the value of the 
objective function. In other words, MM algorithms are iterative ascent or descent 
algorithms. If they are well designed, then they separate parameters in high- 
dimensional problems. This is where CPUs enter. They offer most of the benefits 
of distributed computer clusters at a fraction of the cost. For this reason alone, 
computational statisticians need to pay attention to CPUs. 
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Fig 1. Left: The Rosenbrock (banana) function (the lower surface) and a majorization func- 
tion at point (-1,-1) (the upper surface). Right: MM iterates. 

Before formally defining the MM algorithm, it may help the reader to walk 
through a simple numerical example stripped of statistical content. Consider 
the Rosenbrock test function 

/(x) = lOOixl - X2f + [xi - if (1.1) 
= 100(xf + a;^-2a;?a;2) + (x?- 2x1 + 1), 

familiar from the minimization literature. As we iterate toward the minimum 
at X = 1 = (1, 1); we construct a surrogate function that separates parameters. 
This is done by exploiting the obvious majorization 

-2x1x2 < + X2 + (x^i + a;„2)^ - 2(x^i + a;„2)(2^? + a;2), 

where equality holds when x and the current iterate x„ coincide. It follows that 
/(x) itself is majorized by the sum of the two surrogates 

gi{xi I x„) = 200xt - [200(^2 1 + x„2) - -2x^ + 1 
92{x2 I x„) = 200x1 - 200(a;^i + Xn2)x2 + (a;^a + x^^f . 

The left panel of Figure 1 depicts the Rosenbrock function and its majorization 
gi{xi \ x„) + g2{x2 I x„) at the point -1. 

According to the MM recipe, at each iteration one must minimize the quartic 
polynomial gi{xi \ x„) and the quadratic polynomial (72 (2^2 | x„). The quartic 
possesses either a single global minimum or two local minima separated by a 
local maximum These minima are the roots of the cubic function g'-^^{xi\x.n) and 
can be explicitly computed. We update xi by the root corresponding to the 
global minimum and X2 via Xn-^1.2 = ■^{Xni + Xn2)- The right panel of Figure 1 
displays the iterates starting from xo = —1. These immediately jump into the 
Rosenbrock valley and then slowly descend to 1. 

Separation of parameters in this example makes it easy to decrease the ob- 
jective function. This almost trivial advantage is amplified when we optimize 
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functions depending on tens of thousands to millions of parameters. In these set- 
tings, Newton's method and variants such as Fisher's scoring are fatally handi- 
capped by the need to store, compute, and invert huge Hessian or information 
matrices. On the negative side of the balance sheet, MM algorithms are of- 
ten slow to converge. This disadvantage is usually outweighed by the speed of 
their updates even in sequential mode. If one can harness the power of parallel 
processing CPUs, then MM algorithms become the method of choice for many 
high-dimensional problems. 

We conclude this introduction by sketching a roadmap to the rest of the 
paper. Section 2 reviews the MM algorithm. Section 3 discusses three high- 
dimensional MM examples. Although the algorithm in each case is known, we 
present brief derivations to illustrate how simple inequalities drive separation 
of parameters. We then implement each algorithm on a realistic problem and 
compare running times in sequential and parallel modes. We purposefully omit 
programming syntax since many tutorials already exist for this purpose, and 
material of this sort is bound to be ephemeral. Section 4 concludes with a 
brief discussion of other statistical applications of CPUs and other methods of 
accelerating optimization algorithms. 

2. MM Algorithms 

The MM algorithm like the EM algorithm is a principle for creating optimiza- 
tion algorithms. In minimization the acronym MM stands for majorization- 
minimization; in maximization it stands for minorization-maximization. Both 
versions are convenient in statistics. For the moment we will concentrate on 
maximization. 

Let f{0) be the objective function whose maximum we seek. Its argument 
9 can be high-dimensional and vary over a constrained subset Q of Euclidean 
space. An MM algorithm involves minorizing f{6) by a surrogate function g{6 \ 
On) anchored at the current iterate 0„ of the search. The subscript n indicates 
iteration number throughout this article. If On+i denotes the maximum of g{0 \ 
6n) with respect to its left argument, then the MM principle declares that On+i 
increases f{6) as well. Thus, MM algorithms revolve around a basic ascent 
property. 

Minorization is defined by the two properties 



In other words, the surface ^ g{6 \ On) lies below the surface 6 H- f{9) and 
is tangent to it at the point 6 = On- Construction of the minorizing function 
g{0 I 0„ ) constitutes the first M of the MM algorithm. In our examples g{0 \ On) 
is chosen to separate parameters. 

In the second M of the MM algorithm, one maximizes the surrogate g{0 \ On) 
rather than f{0) directly. It is straightforward to show that the maximum point 



f{0„) = g{On\0„) 
f{0) > g{0\On), 



0^0 



(2.1) 
(2.2) 
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6n+i satisfies the ascent property /(0„+i) > f{On)- The proof 

f{On+l) > 9{0n+l I 0n) > 9{0n \ On) = f{0n) 

reflects definitions (2.1) and (2.2) and the choice of 9n+i- The ascent property 
is the source of the MM algorithm's numerical stability and remains valid if we 
merely increase g{9 \ On) rather than maximize it. In many problems MM up- 
dates are delightfully simple to code, intuitively compelling, and automatically 
consistent with parameter constraints. In minimization we seek a majorizing 
function g(d \ 9n) lying above the surface H> f{0) and tangent to it at the 
point = 6n- Minimizing g{0 \ On) drives f{6) downhill. 

The celebrated Expectation-Maximization (EM) algorithm [7, 21] is a special 
case of the MM algorithm. The Q-function produced in the E step of the EM 
algorithm constitutes a minorizing function of the loglikelihood. Thus, both 
EM and MM share the same advantages: simplicity, stability, graceful adap- 
tation to constraints, and the tendency to avoid large matrix inversion. The 
more general MM perspective frees algorithm derivation from the missing data 
straitjacket and invites wider applications. For example, our multi-dimensional 
scaling (MDS) and non-negative matrix factorization (NNFM) examples involve 
no likelihood functions. Wu and Lange [37] briefly summarize the history of the 
MM algorithm and its relationship to the EM algorithm. 

The convergence properties of MM algorithms are well-known [15]. In partic- 
ular, five properties of the objective function f{0) and the MM algorithm map 
M- M{0) guarantee convergence to a stationary point of f{0): (a) f{0) is coer- 
cive on its open domain; (b) f{0) has only isolated stationary points; (c) M{0) 
is continuous; (d) 9* is a fixed point of M (0) if and only if 0* is a stationary 
point of f{0); and (e) f[M{0*)] > f{0*), with equality if and only if 0* is a 
fixed point of M{0). These conditions are easy to verify in many applications. 
The local rate of convergence of an MM algorithm is intimately tied to how well 
the surrogate function g{0 \ 9*) approximates the objective function f{0) near 
the optimal point 0* . 

3. Numerical Examples 

In this section, we compare the performances of the CPU and GPU implementa- 
tions of three classical MM algorithms coded in C-I--I-: (a) non-negative matrix 
factorization (NNMF), (b) positron emission tomography (PET), and (c) mul- 
tidimensional scaling (MDS). In each case we briefly derive the algorithm from 
the MM perspective. For the CPU version, we iterate until the relative change 

\f{en)-f{On-l)\ 
\f{On-l)\+l 

of the objective function f{0) between successive iterations falls below a pre-set 
threshold e or the number of iterations reaches a pre-set number n,„ax, whichever 
comes flrst. In these examples, we take e = 10~^ and Jimax = 100,000. For ease 
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CPU 


GPU 


Model 


Intel Core 2 


NVIDIA GeForce 




Extreme X9440 


GTX 280 


# Cores 


4 


240 


Clock 


3.2G 


1.3G 


Memory 


16G 


IG 



Table 1 

Configuration of the desktop system 



of comparison, we iterate the GPU version for the same number of steps as the 
CPU version. Overall, we see anywhere from a 22-fold to 112-fold decrease in 
total run time. The source code is freely available from the first author. 

Table 1 shows how our desktop system is configured. Although the CPU 
is a high-end processor with four cores, we use just one of these for ease of 
comparison. In practice, it takes considerable effort to load balance the various 
algorithms across multiple CPU cores. With 240 GPU cores, the GTX 280 GPU 
card delivers a peak performance of about 933 GFlops in single precision. This 
card is already obsolete. Newer cards possess twice as many cores, and up to four 
cards can fit inside a single desktop computer. It is relatively straightforward to 
program multiple CPUs. Because previous generation GPU hardware is largely 
limited to single precision, this is a worry in scientific computing. To assess 
the extent of roundoff error, we display the converged values of the objective 
functions to ten significant digits. Only rarely is the GPU value far off the CPU 
mark. Finally, the extra effort in programming the GPU version is relatively 
light. Exploiting the standard CUDA library [26], it takes 77, 176, and 163 
extra lines of GPU code to implement the NNMF, PET, and MDS examples, 
respectively. 

3.1. Non-Negative Matrix Factorizations 

Non-negative matrix factorization (NNMF) is an alternative to principle com- 
ponent analysis useful in modeling, compressing, and interpreting nonnegative 
data such as observational counts and images. The articles [2, 18, 19] discuss in 
detail algorithm development and statistical applications of NNMF. The basic 
problem is to approximate a data matrix X with nonnegative entries Xij by a 
product VW of two low rank matrices V and W with nonnegative entries Vik 
and Wkj ■ Here X, V, and W are pxq,pxr, and rxq, respectively, with r much 
smaller than min{p, q}. One version of NNMF minimizes the objective function 

/(V,W) = ||X-VW||| = (3.1) 

i j k 

where || ■ ||f denotes the Frobenius-norm. To get an idea of the scale of NNFM 
imaging problems, p (number of images) can range 10^ — 10''', q (number of 
pixels per image) can surpass 10^ — 10^, and one seeks a rank r approximation 
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of about 50. Notably, part of the winning solution of the Netflix challenge relies 
on variations of NNMF [12]. For the Netflix data matrix, p = 480,000 (raters), 
q = 18,000 (movies), and r ranged from 20 to 100. 

Exploiting the convexity of the function x i— >■ {xij — x)'^, one can derive the 
inequality 

2 /hi \ 2 



k k 



7 Xij VikWk] 

^nij \ ^nikj 



where anikj = VnikWnkj and bnij = Y^k^^^ikJ- "^^^^ leads to the surrogate func- 
tion 

5(V,W|V„,W„) = ^^^^fa;,,_^™.,Y (3.2) 



■nikj 

majorizing the objective function /(V, W) = ||X — VW|||. Although the ma- 
jorization (3.2) does not achieve a complete separation of parameters, it docs if 
we fix V and update W or vice versa. This strategy is called block relaxation. 

If we elect to minimize g(V, W | V„, W„) holding W fixed at W„, then the 
stationarity condition for V reads 

-^5(V, W„ I V„, W„) = -2y^( Xij ^-^VikWnkj]wnkj = 0. 



J 



Its solution furnishes the simple multiplicative update 



Ylij Xij'U^nkj 

Vn+l,ik = Vmk^^^ • (3.3) 

/ j j OnijWjikj 



Likewise the stationary condition 
d 



3(V„+i,W I V„+i,W,0 = 

OWkj 

gives the multiplicative update 

^i XijVn+l,ik ,„ 
Wn+l,kj = Wnkj^ , (3.4) 

where Cnij — J2k'^n+i,ikWnkj- Close inspection of the multiplicative updates 
(3.3) and (3.4) shows that their numerators depend on the matrix products 
XW*j and V^^j^X and their denominators depend on the matrix products 
VnW„W^ and V^^;^V„+iW„. Large matrix multiplications are very fast on 
GPUs because CUDA implements in parallel the BLAS (basic linear algebra sub- 
programs) library widely applied in numerical analysis [25]. Once the relevant 
matrix products are available, each elementwise update of Vik or Wkj involves 
just a single multiplication and division. These scalar operations are performed 
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Algorithm 1 (NNMF) Given X e W^""^, find V e M^'"' and W e 
minimizing ||X — VW|||. 

Initialize: Draw ijQife and wqi^j uniform on (0,1) for all 1 < j < p, 1 < fc < r, 1 < i < 9 
repeat 

Compute XW* and V„W„W;^ 

Vn+l,ik ^ t'n^fe ■ {XW^},fe / {V„W„W*„},fe for alll < i < p, 1 < fc < r 
Compute Vj^_^j^X and V^^j^V„+iW„ 

«'n+i,*:i ^ w^kj ■ {V^^-^Xjfej / {V^^-^V„+iW„}fej for all 1 < fc < r, 1 < J < g 
until convergence occurs 



CPU GPU 



Rank r 


Iters 


Time 


Function 


Time 


Function 


Speedup 


10 


25459 


1203 


106.2653503 


55 


106.2653504 


22 


20 


87801 


7564 


89.56601262 


163 


89.56601287 


46 


30 


55783 


7013 


78.42143486 


103 


78.42143507 


68 


40 


47775 


7880 


70.05415929 


119 


70.05415950 


66 


50 


53523 


11108 


63.51429261 


121 


63.51429219 


92 


60 


77321 


19407 


58.24854375 


174 


58.24854336 


112 



Table 2 

Run-time (in seconds) comparisons for NNMF on the MIT CBCL face image data. The 
dataset contains p = 2, 429 faces with g = 19 X 19 = 361 pixels per face. The columns labeled 
Function refer to the converged value of the objective function. 



in parallel through hand-written GPU code. Algorithm 1 summarizes the steps 
in performing NNMF. 

We now compare CPU and GPU versions of the multiplicative NNMF al- 
gorithm on a training set of face images. Database ^1 from the MIT Center 
for Biological and Computational Learning (CBCL) [24] reduces to a matrix 
X containing p = 2,429 gray scale face images with g = 19 x 19 = 361 pixels 
per face. Each image (row) is scaled to have mean and standard deviation 0.25. 
Figure 2 shows the recovery of the first face in the database using a rank r = 49 
decomposition. The 49 basis images (rows of W) represent different aspects of 
a face. The rows of V contain the coefficients of these parts estimated for the 
various faces. Some of these facial features are immediately obvious in the re- 
construction. Table 2 compares the run times of Algorithm 1 implemented on 
our CPU and GPU respectively. We observe a 22 to 112-fold speed-up in the 
GPU implementation. Run times for the GPU version depend primarily on the 
number of iterations to convergence and very little on the rank r of the approx- 
imation. Run times of the CPU version scale linearly in both the number of 
iterations and r. 

It is worth stressing a few points. First, the objective function (3.1) is convex 
in V for W fixed, and vice versa but not jointly convex. Thus, even though 
the MM algorithm enjoys the descent property, it is not guaranteed to find 
the global minimum [2]. There are two good alternatives to the multiplicative 
algorithm. First, pure block relaxation can be conducted by alternating least 
squares (ALS). In updating V with W fixed, ALS omits majorization and solves 
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Fig 2. Approximation of a face image by rank-49 NNMF: coefficients X basis images = 
approximate image. 



the p separated nonnegative least square problems 

min ||X(i,:) - V(i,:)W]||^ subject to V(i, :) > 0, 

V(i,:) 

where V(i, :) and X(i, :) denote the i-th row of the corresponding matrices. 
Similarly, in updating W with V fixed, ALS solves q separated nonnegative 
least square problems. Another possibility is to change the objective function to 



i i k k 



VikWkj 



according to a Poisson model for the counts Xij [18]. This works even when 
some entries Xij fail to be integers, but the Poisson loglikelihood interpretation 
is lost. A pure MM algorithm for maximizing L(V, W) is 



XijWnkj/bnij Y.I XijVnik/bmj 
Vn+l.ik — Vnik\ , Wn+l,ij — Wnkj \ • 

Derivation of these variants of Lee and Seung's [18] Poisson updates is left to 
the reader. 



3.2. Positron Emission Tomography 

The field of computed tomography has exploited EM algorithms for many years. 
In positron emission tomography (PET) , the reconstruction problem consists of 
estimating the Poisson emission intensities A = (Ai, . . . , Ap) of p pixels arranged 
in a 2-dimensional grid surrounded by an array of photon detectors. The ob- 
served data are coincidence counts (j/i, . . . yd) along d lines of fiight connecting 
pairs of photon detectors. The loglikelihood under the PET model is 

LW ^ E ( E ^ E ^'j^j ' 

i 3 3 
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where the Cij are constants derived from the geometry of the grid and the 
detectors. Without loss of generahty, one can assume J2i^ij = 1 ^ach j. 
It is straightforward to derive the traditional EM algorithm [13, 36] from the 
MM perspective using the concavity of the function Ins. Indeed, application of 
Jensen's inequality produces the minorization 

i(A) > ^y,^«;„,,-ln(^)-^^e,,A, = g(A | A„), 

I J ■' I 'J 

where Wnij = A„j/(^j. eifcA„fc). This maneuver again separates parameters. 
The stationarity conditions for the surrogate Q(A | A„) supply the parallel 
updates 

X _ Si Vi^nij ,„ ^^ 

The convergence of the PET algorithm (3.5) is frustratingly slow, even under 
systematic acceleration [29, 39]. Furthermore, the reconstructed images are of 
poor quality with a grainy appearance. The early remedy of premature halting of 
the algorithm cuts computational cost but is entirely ad hoc, and the final image 
depends on initial conditions. A better option is add a roughness penalty to the 
loglikelihood. This device not only produces better images but also accelerates 
convergence. Thus, we maximize the penalized loglikelihood 

/(A) = ^A)-| (A. -A..)' (3.6) 

{j-k}eN- 

where /j, is the roughness penalty constant, and Af is the neighborhood system 
that pairs spatially adjacent pixels. An absolute value penalty is less likely to 
deter the formation of edges than a square penalty, but it is easier to deal with 
a square penalty analytically, and we adopt it for the sake of simplicity. In 
practice, visual inspection of the recovered images guides the selection of the 
roughness penalty constant ^. 

To maximize /(A) by an MM algorithm, we must minorize the penalty in a 
manner consistent with the separation of parameters. In view of the evenness 
and convexity of the function s'^ , we have 

(Aj — Afc)^ < 2^"^^^ ^ ^"-^ ^ ^nkf + 2(^Aa; — Xnj — A„fe)^. 

Equality holds if A^ + Afe = \nj + Xnk^ which is true when A = A„. Combining 
our two minorizations furnishes the surrogate function 

5(A I A„) = Q(A I A„) [(2Aj - - X^k? + {2Xu - A,„ - X^kf 

To maximize g{X \ A„); we define Mj = {k : {i,k} G TV} and set the partial 
derivative 

A,(A I A„) = Y ' ^^^] " ^ E (2A, - A„, - A„..) (3.7) 
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equal to and solve for Xn+i,j- Multiplying equation (3.7) by Xj produces a 
quadratic with roots of opposite signs. We take the positive root 



Xn + l.j 

where 

aj = ^2^^ 1, bnj = ^ (A,y + A„A:) - 1, Cnj = y^^UiWriij- 

k£j\fj k£j\fj i 

Algorithm 2 summarizes the complete MM scheme. Obviously, complete param- 
eter separation is crucial. The quantities aj can be computed once and stored. 
The quantities bnj and c„j are computed for each j in parallel. To improve GPU 
performance in computing the sums over z, we exploit the widely available par- 
allel sum-reduction techniques [30]. Given these results, a specialized but simple 
GPU code computes the updates \n+i,j for each j in parallel. 

Table 3 compares the run times of the CPU and GPU implementations for 
a simulated PET image [29] . The image as depicted in the top of Figure 3 has 
p = 64 X 64 = 4, 096 pixels and is interrogated by d = 2, 016 detectors. Overall 
we see a 43- to 53-fold reduction in run times with the GPU implementation. 
Figure 3 displays the true image and the estimated images under penalties of 
^ = 0, 10^^, 10^®, and 10^^. Without penalty [fi = 0), the algorithm fails to 
converge in 100,000 iterations. 



Algorithm 2 (PET Image Recovering) Given the coefficient matrix E G M!^^^, 
coincident counts y = {yi, . . . ,yd) E Z*^, and roughness parameter ^ > 0, 
find the intensity vector A = (Ai, . . . , Ap) G that maximizes the objective 
function (3.6). 

Scale E to have unit li column norms. 

Compute \Afj\ = J2k:{j,k}ej\r^ '^"'^ '^j ~ '^t^l-^jl ^'^^ 1 < i < P- 

Initialize: Xoj <~ 1, j = I, . . . ,p. 

repeat 

2nij <- {yieij\nj)/{T,k ^ik^nk) for all 1 < j < d, 1 < j < p 

for J = 1 to p do 

Kj *r- p.{\Mj\\nj + HkeMj -^nk) " 1 

^ {-bnj - ^h\- - AajC„j)/{2aj) 

end for 
until convergence occurs 



3.3. Multidimensional Scaling 

Multidimensional scaling (MDS) was the first statistical application of the 
MM principle [5, 6]. MDS represents q objects as faithfully as possible in p- 
dimensional space given a nonnegative weight wtj and a nonnegative dissimi- 
larity measure ytj for each pair of objects i and j. If 0* G W is the position 



2ai 



4a 



CPU GPU QN{10) on CPU 



Penalty fi 


Iters 


Time 


Function 


Iters 


Time 


Function 


Speedup 


Iters 


Time 


Function 


Speedup 





100000 


14790 


-7337.152765 


100000 


282 


-7337.153387 


52 


6549 


2094 


-7320.100952 


n/a 


10-^ 


24457 


3682 


-8500.083033 


24457 


70 


-8508.112249 


53 


251 


83 


-8500.077057 


44 


10-6 


6294 


919 


-15432.45496 


6294 


18 


-15432.45586 


51 


80 


29 


-15432.45366 


32 


10-5 


589 


86 


-55767.32966 


589 


2 


-55767.32970 


43 


19 


9 


-55767.32731 


10 



Table 3. Comparison of run times (in seconds) for a PET imaging problem on the simulated data in [29]. The image has p = 64 X 64 = 4,096 
pixels and is interrogated by d = 2, 016 detectors. The columns labeled Function refer to the converged value of the objective function. The results 
under the heading QN{10) on CPU invoke quasi-Newton acceleration [39] with 10 secant conditions. 
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of object i, then the p q parameter matrix 6 — {9^ , . . . , 9'^) is estimated by 
minimizing the stress function 



f{9) = Y. ^rAvr^-W-e'W? 

l<i<j<q 

= E 

i<j 



(3.8) 



Wijvfj 



i<j 



where — 9^] is the Euchdean distance between 9^ and 9-' . The stress func- 
tion (3.8) is invariant under translations, rotations, and reflections of W. To 
avoid translational and rotational ambiguities, we take 9^ to be the origin and 
the first p—l coordinates of 9^ to be 0. Switching the sign of 9^ leaves the stress 
function invariant. Hence, convergence to one member of a pair of reflected min- 
ima immediately determines the other member. 

Given these preliminaries, we now review the derivation of the MM algorithm 
presented in [16]. Because we want to minimize the stress, we majorize it. The 
middle term in the stress (3.8) is majorized by the Cauchy-Schwartz inequality 



\9' -9^ 



< - 



{9^-9=n9],-9i) 
\\9l-9i\\ 



To separate the parameters in the summands of the third term of the stress, 
we invoke the convexity of the Euclidean norm || • || and the square function . 
These maneuvers yield 



10 ' -0^11 2 



1 r 
2 



29'-{9^^ + 9i 



< 2 



29^-{9i + 9i) 



Assuming that Wij = Wji and yij = yji, the surrogate function therefore becomes 



5(0 I 0„) = 2Y,w,, 



= 

2=1 ji^l L 



9^ - -(91 



9' - -(91 
2^ 



2 y,,{9y{9l,-9i) 

2 y,^i9y{9l,-9iJ 

" w.,,y,,{9r{9l-9i, 
\\9l-9i\\ 



up to an irrelevant constant. Setting the gradient of the surrogate equal to the 
vector produces the parallel updates 



2 



j^t ^^3 
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for all movable parameters B^. 

Algorithm 3 summarizes the parallel organization of the steps. Again the 
matrix multiplications 0^0„ and 0„(W — Z„) can be taken care of by the 
CUBLAS library [25]. The remaining steps of the algorithm are conducted by 
easily written parallel code. 

Table 4 compares the run times in seconds for MDS on the 2005 United 
States House of Representatives roll call votes. The original data consist of the 
671 roll calls made by 401 representatives. We refer readers to the reference [8] 
for a careful description of the data and how the MDS input 401 x 401 distance 
matrix is derived. The weights Wij are taken to be 1. In our notation, the 
number of objects (House Representatives) is q = 401. Even for this relatively 
small dataset, we see a 27-48 fold reduction in total run times, depending on the 
projection dimension p. Figure 4 displays the results in p = 3 dimensional space. 
The Democratic and Republican members are clearly separated. For p ~ 30, the 
algorithm fails to converge within 100,000 iterations. 

Although the projection of points into p > 3 dimensional spaces may sound 
artificial, there are situations where this is standard practice. First, MDS is fore- 
most a dimension reduction tool, and it is desirable to keep p > 3 to maximize 
explanatory power. Second, the stress function tends to have multiple local min- 
ima in low dimensions [9]. A standard optimization algorithm like MM is only 
guaranteed to converge to a local minima of the stress function. As the number 
of dimensions increases, most of the inferior modes disappear. One can formally 
demonstrate that the stress has a unique minimum when p = g — 1 [4, 9]. In 
practice, uniqueness can set in well before p reaches q — I. In the recent work 
[38], we propose a "dimension crunching" technique that increases the chance 
of the MM algorithm converging to the global minimum of the stress function. 
In dimension crunching, we start optimizing the stress in a Euclidean space 
with m > p. The last m — p components of each column 6^ are gradually sub- 
jected to stiffer and stiffer penalties. In the limit as the penalty tuning parameter 
tends to oo, we recover the global minimum of the stress in MP. This strategy 
inevitably incurs a computational burden when m is large, but the MM-I-GPU 
combination comes to the rescue. 



Algorithm 3 (MDS) Given weights W and distances Y e M'?^', find the matrix 
© = [0\ ...,9''] e M.P'^'' which minimizes the stress (3.8). 

Precompute: Xij <— WijPij for all I < i, j < q 
Precompute: ui^. Wij for all 1 < i < g 

Initialize: Draw Sqj. uniformly on [-1,1] for all I < i < q, I < k < p 
repeat 

Compute &l^&n 

dnij ^ {&i&n}ii + {&i&n}n ^ 2{&i&„}ij for all 1 < i,j < q 
^nij <- Xij/dnij for all 1 < j ^ J < g 

<- J2j ^nij for all 1 < j < g 
Compute 0„{W - Z„) 

^n+l,fe ^ [''nfeK- + 2^n,.) + {0n(W - Z„ ) /{2t«,. ) for all 1 < i < p, 1 < fc < <? 
until convergence occurs 



CPU 



GPU 



QN(20) on CPU 



Dim-p 


Iters 


Time 


Stress 


Iters 


Time 


Stress 


Speedup 


Iters 


Time 


Stress 


Speedup 


2 


3452 


43 


198.5109307 


3452 


1 


198.5109309 


43 


530 


16 


198.5815072 


3 


3 


15912 


189 


95.55987770 


15912 


6 


95.55987813 


32 


1124 


38 


92.82984196 


5 


4 


15965 


189 


56.83482075 


15965 


7 


56.83482083 


27 


596 


18 


56.83478026 


11 


5 


24604 


328 


39.41268434 


24604 


10 


39.41268444 


33 


546 


17 


39.41493536 


19 


10 


29643 


441 


14.16083986 


29643 


13 


14.16083992 


34 


848 


35 


14.16077368 


13 


20 


67130 


1288 


6.464623901 


67130 


32 


6.464624064 


40 


810 


43 


6.464526731 


30 


30 


100000 


2456 


4.839570118 


100000 


51 


4.839570322 


48 


844 


54 


4.839140671 


n/a 



Table 4. Comparison of run times (in seconds) for MDS on the 2005 House of Representatives roll call data. The number of points (representatives) 
is q = 401. The results under the heading QN {20) on CPU invoke the quasi-Newton acceleration [39] with 20 secant conditions. 
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Fig 4. Display of the MDS results withp = 3 coordinates on the 2005 House of Representatives 
roll call data. 
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4. Discussion 

The rapid and sustained increases in computing power over the last half cen- 
tury have transformed statistics. Every advance has encouraged statisticians to 
attack harder and more sophisticated problems. We tend to take the steady 
march of computational efficiency for granted, but there are limits to a chip's 
clock speed, power consumption, and logical complexity. Parallel processing via 
GPUs is the technological innovation that will power ambitious statistical com- 
puting in the coming decade. Once the limits of parallel processing are reached, 
we may see quantum computers take off. In the meantime statisticians should 
learn how to harness GPUs productively. 

We have argued by example that high-dimensional optimization is driven by 
parameter and data separation. It takes both to exploit the parallel capabilities 
of GPUs. Block relaxation and the MM algorithm often generate ideal parallel 
algorithms. In our opinion the MM algorithm is the more versatile of the two 
generic strategies. Unfortunately, block relaxation does not accommodate con- 
straints well and may generate sequential rather than parallel updates. Even 
when its updates are parallel, they may not be data separated. The EM algo- 
rithm is one of the most versatile tools in the statistician's toolbox. The MM 
principle generalizes the EM algorithm and shares its positive features. Scoring 
and Newton's methods become impractical in high dimensions. Despite these 
arguments in favor of MM algorithms, one should always keep in mind hybrid 
algorithms such as the one we implemented for NNMF. 

Although none of our data sets is really large by today's standards, they do 
demonstrate that a good GPU implementation can easily achieve one to two 
orders of magnitude improvement over a single CPU core. Admittedly, modern 
CPUs come with 2 to 8 cores, and distributed computing over CPU-based clus- 
ters remains an option. But this alternative also carries a hefty price tag. The 
NVIDIA GTX280 GPU on which our examples were run drives 240 cores at a 
cost of several hundred dollars. High-end computers with 8 or more CPU nodes 
cost thousands of dollars. It would take 30 CPUs with 8 cores each to equal a 
single GPU at the same clock rate. Hence, GPU cards strike an effective and 
cost efficient balance. 

The simplicity of MM algorithms often comes at a price of slow (at best lin- 
ear) convergence. Our MDS, NNMF, and PET (without penalty) examples are 
cases in point. Slow convergence is a concern as statisticians head into an era 
dominated by large data sets and high-dimensional models. Think about the 
scale of the Netflix data matrix. The speed of any iterative algorithm is deter- 
mined by both the computational cost per iteration and the number of iterations 
until convergence. GPU implementation reduces the first cost. Computational 
statisticians also have a bag of software tricks to decrease the number of iter- 
ations [10, 11, 14, 20, 22, 23, 35]. For instance, the recent paper [39] proposes 
a quasi-Newton acceleration scheme particularly suitable for high-dimensional 
problems. The scheme is off-the-shelf and broadly applies to any search algo- 
rithm defined by a smooth algorithm map. The acceleration requires only mod- 
est increments in storage and computation per iteration. Tables 3 and 4 also 
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list the results of this quasi-Newton acceleration of the CPU implementation 
for the MDS and PET examples. As the tables make evident, quasi-Newton 
acceleration significantly reduces the number of iterations until convergence. 
The accelerated algorithm always locates a better mode while cutting run times 
compared to the unaccelerated algorithm. We have tried the quasi-Newton ac- 
celeration on our GPU hardware with mixed results. We suspect that the lack 
of full double precision on the GPU is the culprit. When full double precision 
becomes widely available, the combination of GPU hardware acceleration and 
algorithmic software acceleration will be extremely potent. 

Successful acceleration methods will also facilitate attacking another nag- 
ging problem in computational statistics, namely multimodality. No one knows 
how often statistical inference is fatally flawed because a standard optimiza- 
tion algorithm converges to an inferior mode. The current remedy of choice is 
to start a search algorithm from multiple random points. Algorithm accelera- 
tion is welcome because the number of starting points can be enlarged without 
an increase in computing time. As an alternative to multiple starting points, 
our recent paper [38] suggests modifications of several standard MM algorithms 
that increase the chance of locating better modes. These simple modifications 
all involve variations on deterministic annealing [34]. 

Our treatment of simple classical examples should not hide the wide appli- 
cability of the powerful MM-I-GPU combination. A few other candidate ap- 
plications include penalized estimation of haplotype frequencies in genetics [1], 
construction of biological and social networks under a random multigraph model 
[28], and data mining with a variety of models related to the multinomial dis- 
tribution [40]. Many mixture models will benefit as well from parallelization, 
particularly in assigning group memberships. Finally, parallelization is hardly 
limited to optimization. We can expect to see many more GPU applications in 
MCMC sampling. Given the computationally intensive nature of MCMC, the 
ultimate payoff may even be higher in the Bayesian setting than in the frequen- 
tist setting. Of course realistically, these future triumphs will require a great 
deal of thought, effort, and education. There is usually a desert to wander and 
a river to cross before one reaches the promised land. 
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